Sınıflandırma: Olasılık Regresyonları

I. Ozkan

23 11 2020

Ön Okumalar

Öğrenme Hedefleri

İkili Bağımlı Değişken

\(\{(y_1, X_1),(y_2, X_2),...,(y_n, X_n)\}\)

Bağımlı değişken:

\(y_i \in \{0,1\}\)

Bağımsız değişkenler:

\(X_i=(x_{i1},x_{i2},..,x_{ik})\).

\(E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_3)\)

ve

\(P(Y = 1 \vert X_1, X_2, \dots, X_k) = \beta_0 + \beta_1 + X_{1i} + \beta_2 X_{2i} + \dots + \beta_k X_{ki}\)

İkili Bağımlı Değişken: Doğrusal Olasılık Modeli

  deny pirat hirat     lvrat chist mhist phist unemp selfemp insurance condomin
1   no 0.221 0.221 0.8000000     5     2    no   3.9      no        no       no
2   no 0.265 0.265 0.9218750     2     2    no   3.2      no        no       no
3   no 0.372 0.248 0.9203980     1     2    no   3.2      no        no       no
4   no 0.320 0.250 0.8604651     1     2    no   4.3      no        no       no
5   no 0.360 0.350 0.6000000     1     1    no   3.2      no        no       no
6   no 0.240 0.170 0.5105263     1     1    no   3.9      no        no       no
  afam single hschool
1   no     no     yes
2   no    yes     yes
3   no     no     yes
4   no     no     yes
5   no     no     yes
6   no     no     yes

İlk analiz grafiği:

İkili Bağımlı Değişken: Doğrusal Olasılık Modeli

\(deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + u_i\)

Dependent variable:
deny
pirat 0.604***
(0.061)
Constant -0.080***
(0.021)
Observations 2,380
R2 0.040
Adjusted R2 0.039
Residual Std. Error 0.318 (df = 2378)
F Statistic 98.406*** (df = 1; 2378)
Note: p<0.1; p<0.05; p<0.01


t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.079910   0.031967 -2.4998   0.01249 *  
pirat        0.603535   0.098483  6.1283 1.036e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(\widehat{deny} = -\underset{(0.032)}{0.080} + \underset{(0.098)}{0.604} (P/I \ ratio)\)

İki Değerli Bağımlı Değişken (Binary Dependent Variable)

Dependent variable:
deny
pirat 0.559***
(0.060)
blackyes 0.177***
(0.018)
Constant -0.091***
(0.021)
Observations 2,380
R2 0.076
Adjusted R2 0.075
Residual Std. Error 0.312 (df = 2377)
F Statistic 97.760*** (df = 2; 2377)
Note: p<0.1; p<0.05; p<0.01

Ve standart hatalar (robust)


t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.090514   0.033430 -2.7076  0.006826 ** 
pirat        0.559195   0.103671  5.3939 7.575e-08 ***
blackyes     0.177428   0.025055  7.0815 1.871e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(\widehat{deny} = \, -\underset{(0.033)}{0.091} + \underset{(0.104)}{0.559} (P/I \ ratio) + \underset{(0.025)}{0.177} black\)

Probit Regresyony

Probit regresyonunda:

\(E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 X+ \cdots + \beta_k x)\)

\(\Phi(.)\) birikimli dağılım fonksiyonudur (cumulative distribution function),

\(E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 X+ \cdots + \beta_k x)\)

burada, \(\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\)

Dependent variable:
deny
pirat 2.968***
(0.386)
Constant -2.194***
(0.138)
Observations 2,380
Log Likelihood -831.792
Akaike Inf. Crit. 1,667.585
Note: p<0.1; p<0.05; p<0.01

Call:
glm(formula = deny ~ pirat, family = binomial(link = "probit"), 
    data = HMDA)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.1941     0.1378 -15.927  < 2e-16 ***
pirat         2.9679     0.3858   7.694 1.43e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1663.6  on 2378  degrees of freedom
AIC: 1667.6

Number of Fisher Scoring iterations: 6

\(d_i = \text{sign}(e_i) [-2(y_i \text{log}\hat{p_i} + (1 - y_i)\text{log}(1 - \hat{p_i}))]^{1/2}\)

burada \(e_i = y_i - \hat{p_i}\), bir başka deyişle tahmin edilen olasılığın \(0\) ve \(1\)’lerden oluşan \(y_i\) gözlemlerinden farkıdır.

         p_hat           e
1 0 0.06199418 -0.06199418
2 0 0.07961583 -0.07961583
3 0 0.13783485 -0.13783485
4 0 0.10667109 -0.10667109
5 0 0.13014350 -0.13014350
6 0 0.06918917 -0.06918917
  y      p_hat          d
1 0 0.06199418 -0.3577684
2 0 0.07961583 -0.4073429
3 0 0.13783485 -0.5446254
4 0 0.10667109 -0.4749746
5 0 0.13014350 -0.5280663
6 0 0.06918917 -0.3786798
[1] 1663.585
[1] 1663.585

\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)

burada \(\hat f^{}_{null}\) yalnızca sabit terimim (intercept) olduğu modeli göstermektedir. Bu yolla modelin açıklayıcılığı değerlendirilebilir

Model özetinden bunu bulmak için,

deviance farkları = 1744.2 - 1663.6=80.6 ve df=2379-2378=1, p-değeri 2.7833398^{-19}

Probit Regresyonu (Probit Regression)


z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
pirat        2.96787    0.53698   5.5269 3.259e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(\widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} (P/I \ ratio))\)

Probit Regresyonu (Probit Regression)

Dependent variable:
deny
pirat 2.742***
(0.380)
blackyes 0.708***
(0.083)
Constant -2.259***
(0.137)
Observations 2,380
Log Likelihood -797.136
Akaike Inf. Crit. 1,600.272
Note: p<0.1; p<0.05; p<0.01

z test of coefficients:

             Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -2.258787   0.176608 -12.7898 < 2.2e-16 ***
pirat        2.741779   0.497673   5.5092 3.605e-08 ***
blackyes     0.708155   0.083091   8.5227 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} (P/I \ ratio) + \underset{(0.08)}{0.71} black)\)

lojistic Regresyon (Logit Regression)

Olasılık ve odds ratio

\[\text{probability}=\frac {N_{Y=1}}{N}\]

\(P(X\leq2)=\frac{2}{6} \implies odds=\frac{2/6}{1-2/6}=\frac{2}{4}\)

$= $

\(\implies \text{odds}=\frac{\text{Y=1 Frekansı}/N}{Y \neq 1 \text{ Frekansı}/N}\)

\(\implies \text{odds}=\frac{\text{Y=1 olasılığı}}{1-\text{Y=1 olasılığı}}\)

\(\implies \text{(Y=1) Olasılığı}=\frac{\text{odds}}{1+\text{odds}}\)

lojistic Regresyon

\(F(x) = \frac{1}{1+e^{-x}}\)

ve model,

\(\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\)

Anlamak için, log odds hesaplamasından başlarsak

\(l=ln \left( \frac{P(Y=1)}{1-P(Y=1)} \right)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\)

\(\implies P(Y=1)= \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}\)

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio))\)

ve

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio) + \beta_2black)\)

Dependent variable:
deny
(1) (2)
pirat 5.884*** 5.370***
(0.734) (0.728)
blackyes 1.273***
(0.146)
Constant -4.028*** -4.126***
(0.269) (0.268)
Observations 2,380 2,380
Log Likelihood -830.094 -795.695
Akaike Inf. Crit. 1,664.188 1,597.390
Note: p<0.1; p<0.05; p<0.01

z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
pirat        5.88450    1.00015   5.8836 4.014e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
pirat        5.37036    0.96376   5.5723 2.514e-08 ***
blackyes     1.27278    0.14616   8.7081 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lojistic Regresyon (Model Parametre Tahminleri)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\)

OddsRatio 2.5 % 97.5 %
(Intercept) 0.018 0.010 0.030
pirat 359.422 88.342 1565.122
OddsRatio 2.5 % 97.5 %
(Intercept) 0.016 0.009 0.027
pirat 214.941 53.848 931.657
blackyes 3.571 2.675 4.747

\(P(Y=1|(P/Iratio=0.3, black=1))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3 + 1.27)}} \approx 0.224\)

\(P(Y=1|(P/Iratio=0.3, black=0))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3)}} \approx 0.075\)

Reddedilme olasılık farkları \(0.149\) olarak bulunmaktadır.

lojistic Regresyon

Lojistic Regresyon, Genişletilmiş Model

\(\begin{align*} lvrat = \begin{cases} \text{low} & \text{if} \ \ lvrat < 0.8, \\ \text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\ \text{high} & \text{if} \ \ lvrat > 0.95 \end{cases} \end{align*}\)

lojistic Regresyon, Genişletilmiş Model

Dependent variable:
deny
OLS logistic probit
(1) (2) (3) (4) (5) (6)
blackyes 0.084*** 0.688*** 0.389*** 0.371*** 0.363*** 0.246
(0.023) (0.183) (0.099) (0.100) (0.101) (0.479)
pirat 0.449*** 4.764*** 2.442*** 2.464*** 2.622*** 2.572***
(0.114) (1.332) (0.673) (0.654) (0.665) (0.728)
hirat -0.048 -0.109 -0.185 -0.302 -0.502 -0.538
(0.110) (1.298) (0.689) (0.689) (0.715) (0.755)
lvratmedium 0.031** 0.464*** 0.214*** 0.216*** 0.215** 0.216***
(0.013) (0.160) (0.082) (0.082) (0.084) (0.083)
lvrathigh 0.189*** 1.495*** 0.791*** 0.795*** 0.836*** 0.788***
(0.050) (0.325) (0.183) (0.184) (0.185) (0.185)
chist 0.031*** 0.290*** 0.155*** 0.158*** 0.344*** 0.158***
(0.005) (0.039) (0.021) (0.021) (0.108) (0.021)
mhist 0.021* 0.279** 0.148** 0.110 0.162 0.111
(0.011) (0.138) (0.073) (0.076) (0.104) (0.077)
phistyes 0.197*** 1.226*** 0.697*** 0.702*** 0.717*** 0.705***
(0.035) (0.203) (0.114) (0.115) (0.116) (0.115)
insuranceyes 0.702*** 4.548*** 2.557*** 2.585*** 2.589*** 2.590***
(0.045) (0.576) (0.305) (0.299) (0.306) (0.299)
selfempyes 0.060*** 0.666*** 0.359*** 0.346*** 0.342*** 0.348***
(0.021) (0.214) (0.113) (0.116) (0.116) (0.116)
singleyes 0.229*** 0.230*** 0.226***
(0.080) (0.086) (0.081)
hschoolyes -0.613*** -0.604** -0.620***
(0.229) (0.237) (0.229)
unemp 0.030* 0.028 0.030
(0.018) (0.018) (0.018)
condominyes -0.055
(0.096)
I(mhist == 3) -0.107
(0.301)
I(mhist == 4) -0.383
(0.427)
I(chist == 3) -0.226
(0.248)
I(chist == 4) -0.251
(0.338)
I(chist == 5) -0.789*
(0.412)
I(chist == 6) -0.905*
(0.515)
blackyes:pirat -0.579
(1.550)
blackyes:hirat 1.232
(1.709)
Constant -0.183*** -5.707*** -3.041*** -2.575*** -2.896*** -2.543***
(0.028) (0.484) (0.250) (0.350) (0.404) (0.370)
Observations 2,380 2,380 2,380 2,380 2,380 2,380
R2 0.266
Adjusted R2 0.263
Log Likelihood -635.637 -636.847 -628.614 -625.064 -628.332
Akaike Inf. Crit. 1,293.273 1,295.694 1,285.227 1,292.129 1,288.664
Residual Std. Error 0.279 (df = 2369)
F Statistic 85.974*** (df = 10; 2369)
Note: p<0.1; p<0.05; p<0.01

lojistic Regresyon, Model Seçimi

Analysis of Deviance Table

Model: binomial, link: probit

Response: deny

Terms added sequentially (first to last)

            Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                         2379     1744.2              
black        1   80.762      2378     1663.4 < 2.2e-16 ***
pirat        1   69.136      2377     1594.3 < 2.2e-16 ***
hirat        1    0.758      2376     1593.5  0.384045    
lvrat        2   51.607      2374     1541.9 6.219e-12 ***
chist        1   86.333      2373     1455.6 < 2.2e-16 ***
mhist        1    4.328      2372     1451.2  0.037486 *  
phist        1   37.318      2371     1413.9 1.003e-09 ***
insurance    1  130.348      2370     1283.6 < 2.2e-16 ***
selfemp      1    9.887      2369     1273.7  0.001665 ** 
single       1    6.890      2368     1266.8  0.008670 ** 
hschool      1    6.899      2367     1259.9  0.008624 ** 
unemp        1    2.678      2366     1257.2  0.101747    
black:pirat  1    0.011      2365     1257.2  0.914784    
black:hirat  1    0.552      2364     1256.7  0.457674    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lojistic Regresyon, Modeller Karşılaştırılabilir

Analysis of Deviance Table

Model 1: deny ~ black + pirat + hirat + lvrat + chist + mhist + phist + 
    insurance + selfemp
Model 2: deny ~ black + pirat + hirat + lvrat + chist + mhist + phist + 
    insurance + selfemp + single + hschool + unemp
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      2369     1273.7                          
2      2366     1257.2  3   16.467 0.0009096 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Karşılaştırmalar ve

Model Performansı

McFadden - pseudo R²

Yorumnalanması için:
https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation

\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)

\(LL\) her zaman negatif

\(0.2< \text{pseudo-}R^2 <0.4\): İyi,
\(0.2< \text{pseudo-}R^2 >0.5\): Çok iyi

Bu konuda daha geniş bilgi için UCLA IDRE web sitesine bakabilirsiniz

fitting null model for pseudo-r2
         llh      llhNull           G2     McFadden         r2ML         r2CU 
-636.8470560 -872.0853045  470.4764969    0.2697422    0.1793669    0.3452950 
fitting null model for pseudo-r2
         llh      llhNull           G2     McFadden         r2ML         r2CU 
-628.6136766 -872.0853045  486.9432557    0.2791833    0.1850251    0.3561875 
fitting null model for pseudo-r2
         llh      llhNull           G2     McFadden         r2ML         r2CU 
-625.0644460 -872.0853045  494.0417170    0.2832531    0.1874522    0.3608598 
fitting null model for pseudo-r2
         llh      llhNull           G2     McFadden         r2ML         r2CU 
-628.3321632 -872.0853045  487.5062827    0.2795061    0.1852179    0.3565586 

Model Performansı

\(\begin{align*} Y_i = \begin{cases} 1 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) > 0.5, \\ 0 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) < 0.5, \\ \end{cases} \end{align*}\)

Sonra tahmin edilen \(Y_i\) sınıfları değerlendirilir. Eşik değeri farklı bir değere de farklı ölçümler yolu ile atanabilir. Bunlardan bir tanesi, Bilgi Kazanımı (Information Gain) ölçümüdür

Reddedilme olasılığını yalnızca ödemenin gelire oranı ile tahmin etmeye çalıştığımız ilk modelimizde, olasılık eşik değeri \(0.5\) olarak düşünülürse, yanlış sınıflama hatası 0.1176 olarak bulunur

ve karışıklık matrisi (sütunlar gerçek, satırlar tahmin, the confusion matrix)

     0   1
0 2089 274
1    6  11

ROC Eğrisi

\(TPR=\frac{TP}{TP+FN} \: and \: FPR=\frac{FP}{FP+TN}\)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\)

İkiden Çok Değerli (Multinomial), Sıralı Değerli (Ordinal)

Kısa bir tartışma için aşağıdaki linkler kullanılabilir:

Multinomial Regression

Ordinal Regression

Interval Regression